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Abstract 

We present the results of gravitational direct A^-body simulations using the com- 
mercial graphics processing units (GPU) NVIDIA Quadro FX1400 and GeForce 
8800GTX, and compare the results with GRAPE-6Af special purpose hardware. 
The force evaluation of the A^-body problem was implemented in Cg using the 
GPU directly to speed-up the calculations. The integration of the equations of mo- 
tions were, running on the host computer, implemented in C using the 4th order 
predictor-corrector Hermite integrator with block time steps. 

We find that for a large number of particles (A^ ^ 10^) modern graphics processing 
units offer an attractive low cost alternative to GRAPE special purpose hardware. A 
modern GPU continues to give a relatively flat scaling with the number of particles, 
comparable to that of the GRAPE. Using the same time step criterion the total 
energy of the A^-body system was conserved better than to one in 10^ on the GPU, 
which is only about an order of magnitude worse than obtained with GRAPE. For 
A^ ^ 10^ the GeForce 8800GTX was about 20 times faster than the host computer. 
Though still about an order of magnitude slower than GRAPE, modern GPU's 
outperform GRAPE in their low cost, long mean time between failure and the much 
larger onboard memory; the GRAPE-6Af holds at most 256k particles whereas the 
GeForce 8800GTF can hold 9 million particles in memory. 
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1 Introduction 



Since tlie first large scale simulations of self gravitating systems the direct 
A^-body method has gained a solid footing in the research community. At 
the moment A^-body techniques are used in astronomical studies of plane- 
tary systems, debris di scs, stellar clusters, galaxies all the way to simulations 
of the entire universe (IHutl . 120071 ) . Outside astronomy the main areas of re- 
search which utilise the same techniques are molecular dynamics, elementary 
particle scattering simulations, plate tectonics, traffic simulations and chemi- 
cal reaction network studies. In the latter non-astronomical applications, the 
main force evaluating routine is not as severe as in the gravitational A^-body 
simulations, but the backbone simulation environments are not very different. 

The main difficulty in simulating self gravitating systems is the lack of anti- 
gravity, which results in the requirement of global communication; each object 
feels the gravitational attraction of any other object. 



The first astr onomical simula tion of a self gravitating A^-body system was 
carried out by lHolmbergl (Il94ll ) with the use of 37 light bulbs and photoelectric 
cells to evaluate the forces on the individual objects. Holmberg spent weeks in 
order to perform this quite moderate 37-particle simulation. Over the last 60 
or so years many different techniques have been introduced to speed up the 
kernel calculation. Today, such a calculation requires about 50 000 integration 
steps for one dynamical time unit. At a speed of ~ 10 Gflop/s the calculation 
would be performed in a few seconds. 



The gravitational A^-body problem has made enormous advances in the last 
decade due to algori t hmic design. The introduction of digital computer s in the 
arena (jvon Hoernerl . Il963l : lAarseth &: Hoyld . Il964j : Ivan Albadal . Il968l ) led to 
a relatively quick evaluation of mutual particle forces. Advanced integration 
techniques introduced to turn the particle forces in a pre dicted space-time 
traje c tory, opened th e way to predictable theoretical results (lAarseth fc Lecarl . 
19751 : lAarsethl . 119991 ). One of the major developments in the speed-up and 
improved accuracy of the direct A^-body problem was the introductio n of the 
block-time step algorithm (IMakinol . Il99ll : iMcMillan fc Aarsethl . Il993l ). 



In the late 1980s it became q uite cle a r that the advances of modern computer 
technology via Moore's law (iMoord. Il965l) was insufficient to simulate large 
star clusters by the new decade flMakino fc Hutl . Il988l . Il990h . This realization 
brought forward the initiatives employed around the devel opment of special 



hardware for eva l uating the forces between the partic l es (lApplegate et al 



1986 



Taiii et all . Il996l : iMakino fc Taiii 1 19981 : iMakind . l200ll : iMakino et al 



2003f) ■ and of the efficient use of assembler code on general purpo se hardware 
fiNitadori. Makino. fc Hutl . l2006l : iNitadori. Makino. fc Abl . l2007h . 
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One method to improve performance is by parallelising force evaluation Eq. 1^ 

for us e on a Beowulf or cluster computer (with or without dedicated hardware) (IHarfst et al. 



20061 ) ■ a large parallel supercomputer (|Makind . l2002l : iDorband et all l2003l ) 



or for grid operations (iGualandris et al.l . 120071 ) . In particular for distributed 
hardware it is crucial to implement an algorithm that limits communication 
as much as possible, otherwise the bottleneck simply shifts from the force 
evaluation to interprocessor communication. 



A breakthrough in direct-summation A^-body simulations came in the late 
1990s with the developm ent of the GRAPE series of special-purpose computers 
(iMakino &: Taiji Il998l ). which achieve spectacular speedups by implementing 
the entire force calculation in hardware and placing many force pipelines on 
a single chip. The latest special purpose computer for gravitational A^-bodv 
simul ations, GRAPE-6, performs at a peak speed of about 64Tfiop/s (IMakind . 
200lh . 



In our standard setup, one GRAPE-6Af processor board is attached to a 
host workstation, in much the same way that a floating-point or graphics 
accelerator card is used. We use a smaller version: the GRAPE-6Af which 
has four chips connected to a personal workstation via the PCI bus delivering 
a theoretical peak performa nce of ~ 131 G f lops f or systems of up to 128k 
particles at a cost of ~ $6K (IFukushige et al.l . l2005l ). Advancement of particle 



positions [0(A^)] is carried out on the host computer, while interparticle forces 
[0{N^)] are computed on the GRAPE. 

The latest developments in this endeavour is the design and construction of 
the GRAPE-DR, the special p urpose compute r which will break the Pflop/s 
barrier by the summer of 2008 flMakinol . 120071 1^. One of the main arguments 
to develop such a high powere d and re l atively diverse computer is t o perform 
simulations of entire galaxies ( Making . 2005al : [Hoekstra et al . 2007). 



The main disadvantages of these special purpose computers, however, are the 
relatively short mean time between failure, the limited availability, the limited 
applicability, the limited on-board memory to store particles, the simple fact 
that they are basically build by a single research team led by prof. J. Makino 
and the lack of competing architectures. 

The gaming industry, though not deliberately supportive of scientific research, 
has been developing high power parallel vector processors for performing 
specific rendering applications, which are in particular suitable for boost- 
ing the frame-rate of games. Over the last 7 years graphics processing units 
(GPUs) have evolved from fixed function hardware for the support of prim- 
itive graphical operations to programmable processors that outperform con- 
ventional CPUs, in particular for vectorizable parallel operations. Regretfully, 



See http : //grape . astron . s . u-tokyo . ac . jp/grape/ computer/ grape-dr . html 
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the precision of these processors is still 32-bit IEEE which is below the av- 
erage general purpose processor, but for many applications it turns out that 
the higher (double) precision is not crucial or can be emulated at some cost. 
It is because of these developments, that more and more people use the GPU 



for w i der purposes than just for graphics (iFernandd . |2004J : iPharr &: Fernando 



20051 : iBuck et al.l . |200J) . This type of programming is also called general pur- 
pose computing on graphics processing units (GPGPUl!]. Earlier attempts to 
use a GPU for gravitational N-body simulations wer e carried out a pprox imate 



use a Lrr U tor gravitational iN-body simulations wer e carried out a pprox imate 
force evaluation methods using shared time steps iNvland et ai (l2004h . but 



provide little improvement in performance. A 25-fould s peed increa s e com - 



pared to an Intel Pentium IV processor was reported by lElsen et al.l (120061 ). 
but details of their implemtation of the force evaluation alrgorithm are yet 
unclear. 

Using the GPU as a general purpose vector processor works as follows. Colours 
in a computer are represented by one or more numbers. The luminance can 
be represented by just a single number, whereas a coloured pixel may contain 
separate values indicating the amount of red, green and blue. A fourth value 
alpha may be included to indicate the amount of transparency. Using this 
information, a pixel may be drawn. There are many pixels in a frame, and 
ideally, these should be updated all at the same time and at a rate exceeding 
the response time of the human eye. This requires fast computations for up- 
dating the pixels, for example when a camera moves or a new object comes 
into view. Such operations usually have an impact on many or even all pixels 
fast computations are required. But since the majority of pixes do not require 
information from other pixes, processing can be done efficiently in parallel. 
All information required to build a pixel should go through a series of similar 
operations, a technique which is better known as single instruction, multiple 
data (SIMD). There are many different kinds of operations this information 
needs to go through. The stream programming model has been designed to 
make the information go through these operations efficiently, while exposing 
as much parallelism as possible. The stream programming model views all in- 
formations as "streams" of ordered data of the same data type. The streams 
pass through "kernels" that operate on the streams and produce one or more 
streams as output. 

In this paper we report on our endeavour to convert a high precision produc- 
tion quality A^-body code to operate with graphics processor units. In § 2 we 
explain the adopted A^-body integration algorithm, in § 3 we address the pro- 
gramming environment we used to program the GPU, In the sections § 4 and 
§ 5 we present the results on two GPUs and compare them with GRAPE-6Af 
and we discuss a model to explain the GPUs performance. In § 6 we summarise 
our findings, and in the Appendix we present a snippet of the source code in 



see http : //www . gpgpu . org 



4 



Cg. 



2 Calculating the force and integrating the particles 



The gravitational evolution of a system consisting of N stars with masses rrij 
and at position r^ is computed by the direct summation of the Newtonian 
force between each of the stars. The force Fj acting on particle i is then 
obtained by summation of all other A'" — 1 particles 

^ ri-r 

Fi = miSLi ^ rriiG Yl ^ j i '^13 ■ (1) 

j=l,j^i I * 



Here G is the Newton constant. 

A cluster consisting of stars evolves dynamically due to the mutual gravity 
of the individual stars. For an accurate force calculation on each star a total 
of ^N{N — 1) partial forces have to be computed. This 0(A'^^) operation is 
the bottleneck for the gravitational A^-body problem. 

The GPU scheme described in this paper is implemented in the A^-body inte- 
grator. Here particle motion is calculated using a fourth-order, individual-time 
step "Hermite" predictor-corrector scheme (Makino and Aarseth 1992). This 
scheme works as follows. During a time step the positions (x) and velocities 
(v = x) are first predicted to fourth order using the acceleration (a = x) and 
the "jerk" (k = a, the time derivative of the acceleration) which are known 
from the previous step. 

The predicted position (xp) and velocity (vp) are 



Xp = x+ (v + (rft/2)(a+ (dt/3)k))(it, (2) 
Vp = v + {a+ {dt/2)k)dt. (3) 



The acceleration and jerk are then recalculated at the predicted time, using Xp 
and Vp. Finally, a correction is based on the estimated higher-order derivatives: 



a3 = 2(a-aj,) + (k + kp)dt, (4) 
a2 = -3(a-ap) - (2k + kp)cii. (5) 



where 
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a2 = kdt'^/2, (6) 

a3 = kdtV6- (7) 

Which then leads to the new position and velocity at time t + dt. 

X = Xp + (a2/12 + a3/20)rft^ (8) 

v = Vp + (a2/3 + a3/4)rft (9) 



The new a and k are computed by direct summation, and the motion is 
subsequently corrected using the additional derivative information thereby 
obtained. 

A single integration step in the integrator proceeds as follows: 

• Determine which stars are to be updated. Each star has an individual time 
(tj) associated with it at which it was last advanced, and an individual time 
step (dti). The list of stars to be integrated consists of those with the smallest 
U + dti. Time steps are constrained to be p owers of 2, allowing "blocks " of 



many stars to be advanced simultaneously (IMcMillan fc Aarsethl . Il993l ). 

• Before the step is taken, check for system reinitialization, diagnostic output, 
termination of the run, storing data. 

• Perform low-order prediction of all particles to the new time + dti. This 
operation may be performed on the GPU, if available. 

• Recompute the acceleration and jerk on all stars in the current block (using 
the GPU, if available), and correct their positions and velocities to fourth- 
order. 

Note that this scheme is rather simple as it does not include treatment for 
close encounters, binaries or higher order (hierarchical or democratic) stable 
multiple systems. 



3 The programming environment 



The part of the algorithm that executes on the GPU (th e force evaluation) is 



i mplem ented in the Cg computer language (C for graphics. iFernando fc Kilgard 



(120031 ) ■ see Appendix A), which has a syntax quite similar to C. The Cg pro- 
gramming environment includes a compiler and run-time libraries for use with 
the open graphics library (OpenGL)[£] and DirectxQ graphics application pro- 
gramming interfaces. Though originally developed for the creation of real-time 



^ see 'http : / /www . opengl . orgi 

^ see http : //www .microsoft . com/directx 
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Table 1 

Detailed information on the hardware used in our experiments. The first column 
gives the parameter followed by the four different hardware setups (GRAPE-6Af, 
GeForce 8800GTX, Quadro FX1400 and ir iformation about the host computer. The 
information for the GRAPE is taken from Making et al. I (l2003l ). the GPU informa- 
tion is from http://www.nvidia.com. The hardware details are the number of pro- 
cessors pipelines (ripipe), the processor's clock frequency (/gpu = l/^GPu)) the mem- 
ory bandwidth for communication between host and attached processor (l/tbus)i the 
amount of memory (in number of particles, one particle requires 84 bytes, here we 
adopt Ik = 1024). For measured hardware parameters, see Tab. 3. 



data 


GRAPE-6Af 


8800GTX 


FX 1400 


Xeon 


unit 


^pipe 


48 


128 


12 


1 




/gpu 


90 


575 


350 


3400 


MHz 


l/ibus 


33.8 


86.4 


19.2 


NA 


GB/s 


Memory 


128k 


9362k 


1562k 




particles 



special effects without the need to program directly to the graphics hardware 
assembly language, researchers soon recognised the potential of Cg and started 
to apply it not only to high-performance graphics but also to a wide variety of 
"gene ral-purpose computing" problems (jFernandd . l2004l : iPharr fc Fernando! . 
2005h . 



3.1 Mapping the N-hody problem to a GPU 



The challenge in the implementation of an efficient A^-body code on a GPU 
lies in the mapping of the algorithm and the data to graphical entities sup- 
ported by the Cg language. Particle data arrays are represented as "textures" . 
Normally, textures are used to represent pixels colour attributes with one sin- 
gle component (luminance, red, green, blue or alpha), three components (red, 
green and blue) or four components (red, green, blue and alpha). In our im- 
plementation we use multiple textures to represent the input and output data 
of N particles, as follows: 

• Input: mass (A^), position (3A^) and velocity (3A^) 

• Output: acceleration (3A^), jerk (3A^) and potential (A^) 

All values are represented as single precision (32-bit) floating point numbers 
for a total of 21 floats or 84 bytes per particle. In Appendix A we present a 
snippet of the source code in Cg, showing the implemented force evaluation 
routine. With the 768 Mbyte on-board memory of the GeForce 8800GTX it 
can store about 9 million particles, whereas the GRAPE-6Af can store only 
128k (see Tab. 1). 



7 



Transferring data from CPU to GPU is accomplished through the definition 
of textures, which can either be read-only or write-only, but not both at the 
same time. The data structures in the CPU are then copied onto appropriately 
defined textures in the graphics card's memory. Obtaining the results from 
GPU to CPU is done by reading back the pixels from the appropriate rendering 
targets into data structures on the host CPU. Therefore the output textures 
(acceleration, jerk and potential) are represented by a double-buffered scheme, 
where after each GPU computation the textures are swapped between reading 
and writing. There is some additional overhead (of order N) for this operation 
which has to be performed every block time step. 

Conventionally, graphics cards render into a "frame buffer" , a special memory 
area that represents the image seen on a display. However, a frame buffer is 
unsuitable for our purposes as the data elements in this buffer are "clamped" 
to value ranges that map the capabilities of the display. Invariably this means 
that 32-bit real vectors are reduced in resolution and therefore in accuracy 
too. This is perfectly fine for visual displays where the number of colours af- 
ter clamping are still 2^^ 16 million), sufficient to make two neighbouring 
colours indiscernible to the human eye. However, this is unacceptable for scien- 
tific production calculations. The workaround is to create an off-screen frame 
buffer object and instruct GPU programs to render into these rather than to 
the screen. Off-screen frame buffers support 32-bit floating point values and 
are not clamped and therefore preserve their precision. 

The GPU has two main kernel operations available in programmable graphics 
pipelines, these are a "vertex shader" and a "fragment shader". Our imple- 
mentation only makes use of the fragment shader pipeline as it is better suited 
for the kind of calculations in the N-body problem and because the fragment 
pipeline in general provides more processing poweiT^. The host CPU is re- 
sponsible for allocating the input textures and frame buffer objects, copy the 
data between CPU and GPU, and binding textures that are to be processed 
by kernels. The lower order prediction and correction of the particle positions 
is done on the host CPU. In Tab. 1 we summarise the hardware properties of 
the two adopted CPU's and the GRAPE. 



4 Results 

To test the various implementations of the force evaluator we perform several 
tests on different hardware. For clarity we perform each test with the same 

^ Before the 8800GTX family of GPUs, vertex programs and fragment programs 
had to execute on distinct processing units. The 8800GTX is the first generation of 
GPUs where this distinction no longer exists and the two are unified. 
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Table 2 

Results of the performance measurements for a Plummer sphere with N equal mass 
particles initially in virial equilibrium for 0.25A^-body time units (from t = 0.25 
to t = 0.5) using a softening of 1/256. In the first column we list the number of 
particles, followed by the timing results of the GRAPE in seconds. In the last column 
we give the timing results for the calculation without an attached processor. The 
GRAPE (second column) was measured up to 128k particles, because the on-board 
memory did not allow for larger simulations. Simulations on the FX1400 and the 
host computer were limited for practical reasons. 



N 


GRAPE-6Af 


8800GTX 


FX1400 


Xeon 


256 


0.07098 


2.708 


3.423 


0.1325 


512 


0.1410 


8.777 


10.59 


0.5941 


1024 


0.3327 


17.46 


20.20 


2.584 


2048 


0.7652 


45.27 


54.16 


10.59 


4096 


1.991 


128.3 


157.8 


50.40 


8192 


5.552 


342.7 


617.3 


224.7 


16384 


16.32 


924.4 


3398 


994.0 


32768 


51.68 


1907 


13180 


4328 


65536 


178.2 


3973 


40560 


19290 


131072 




8844 






262144 




22330 






524288 




63960 







realization of the initial conditions. For this we vary the number of stars from 
N = 256 particles with steps of two to half a million stars (see Tab. 2). Not 
each set of initial condition is run on every processor, as the Intel Xeons, for 
example, would take a long time and the scaling with is unlikely to change 
as we clearly have reached the CPU-limited calculation regime (see §5). 



The initial conditions for each set of simulations were genera ted by randomly 
selecting the stellar positions and velocities according to the iPlummerl (jl91ll ) 
distribution using the method described by lAarseth et al.l (119741 ). Each of the 
stars were given the same mass. The initial particle representations were scaled 
to virial equilibrium before starting the calculation. 



Each set of initial condi tions is run from t = to t = 0.50 time units 
( Heggie &: Mathieu . 1986 jlZ, but the performance is measured only over the 
last quarter of an A^-body time unit to reduce the overhead for reading the 
snapshot and the initialisation of the integrator. The maximum time step 



see also littp://eii.wikipedia.org/wiki/Natural_uiiits#N-body_units 
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for the particles was 0.125, to guarantee that each particle was evaluated at 
least twice during the course of the simulation. The force calculations were 
performed by adopting a softening of 1/256 for all simulations. 

For our performance measurements we have used four nodes of a Hewlett- 
Packard xw8200 workstation with a dual Intel Xeon CPU running at 3.4 GHz 
and either the GRAPE, a Quadro FX1400 or GeForce 8800GTX graphics card 
in the PCI Express (16 x) bus. The cluster nodes were running a Linux SMP 
kernel version 2.6.16, Cg version 1.4, graphics card driver version 1.0-9746 and 
the OpenGL 2.0 bindings. 

In Figure 1 we show the timing results of the A^-body simulations. The FX1400 
is slower than the general purpose computer over the entire range of in our 
experiments. The bad performance of the FX1400 is mainly attributed to the 
additional overhead in communication and memory allocation. For A^ ^ 10^ 
the GeForce 8800GTX GPU is slower than the host computer but continues 
to have a relatively flat scaling, comparable to the GRAPE-6, whereas the 
host has a much worse (oc A^^) scaling. The scaling of the compute time of 
the GPU is proportional to that of the GRAPE (oc A^^/^), but the latter has 
a smaller offset by about an order of magnitude. This is mainly caused by 
the efficient use of the GRAPE pipeline, which requires fewer clock cycles per 
force evaluation compared to the GPU (see §6). 



5 Performance modelling of the GPU 



In modelling the performance of t he GPU we adopt the model proposed by 
Makinol fl2002h : iHarfst et al.1 fl2006l ) but tailored to the host plus GPU and to 
the GRAPE architecture. 



The wall clock time required for advancing the nbiock particles in a single block 
time step in the A^-body systems is 



^step ^host ~l~ ^forco ~l~ 



(10) 



Here thost = ^pred + ^corr is the time spend on the host computer for predict- 
ing and correcting the particles in the block, tforce is the time spend on the 
attached processor and tcomm is the time spend communicating between the 
host and the attached processor. We now discuss the characteristics of each of 
the elements in the calculation for tstep- 



Host operation. The predictions and corrections of the particles are calcu- 
lated on the host computer, and the time for this operation is directly related 



10 
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Fig. 1. Timing of several implementations of the gravitational A'"-body simulations 
for N = 256 particles to iV = 512k particles (only for the 8800GTX, the others up 
to 64k) over one iV-body time unit. The 8800GTX are represented with open circles 
connected with a solid curve, the GRAPE is given by bullets with dashed line. The 
thin dashed (triangles) line and thin dotted (squares) lines give the results of the 
calculations with the FX1400 and with only the host computer. Note that in timings 
in Tab. 2 were multiplied by a factor of four to estimate the compute time for one 
dynamical time unit, rather than the 1/4*^^ over which the timing calculations were 
performed. 

to the speed of the host processor icpu, the number of operations in the pre- 
diction step ripred and in the correction step ricorr- The total time spend per 
block step then yields 



^pred — ^pred^cpu-^) 



(11) 



for the prediction and 

tcoTT — ^corr^cpu-^- (l^) 

for the correction. The number of operations per prediction step ripred — 300 
and for the correction ricorr — 1000. This operation could be performed on 
the GPU, though the GRAPE is not designed for the prcdictior and corrector 
calculation. For a fair comparison between the GRAPE and the GPU and in 
order to preserve high accuracy we performed these calculations on the host. 
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Communication. The time spend communicating between the host and 
the attached processor is expressed by the sum of the time needed to send 
^send particles to the acceleration hardware and the time needed to receive 
rirec particles from the acceleration hardware: 

^comm ^send'^send'^send ~l~ '7rec^rec'^send- (-^'^) 



Here ?7send4end and r/rec^rec are the time needed to send and receive one par- 
ticle, respectively. For the computation without the hardware acceleration 
^comm = 0, since the forces between all particles are calculated locally. For 
the GRAPE and the GPUs, however, a considerable amount of time is spend 
in communication. For the GRAPE sending data is equally fast as receiving 
data, i.e: tsend = ^rec = ^bus- Sending data to the GPU is considerably slower 
than receive data (see Tab. 3). 



The two send and receive efficiency factors ?7send and T^rec, are the product 
of the overhead r]o and the number of by tes per particle that has to be 



send or received. The overhead 



188 (IFukushige et all . [20051) for each 



of the attached processors. Since for the GRAPE the send and receive oper- 
ation are equally expensive we can just count the number of bytes that has 
to be transported per particle, which for the GRAPE hardware is 72 bytes 



flFukushige et all . l2005l : iHarfst et al.l . [20061). For the GRAPE we then write 



'7send^send ~t~ '7rec^rec 



72 X 188tbus- 



For the GPU r/send > Vrec (see Table 3 for the measured values). The additional 
overhead rjo is the same as for the GRAPE, but per particle the number of 
bytes to send is different from than the number to receive. As we discussed in 
§ 3.1 a total of 56 bytes has to be send from the host to the GPU, whereas only 
28 bytes are received. For the GPU we then write rjsend = 56 x 188, whereas 
r/rec = 28 X 188. 

In addition to the difference in the speeds for sending and receiving data, the 
GPU suffers from an additional penalty. The GRAPE sends the particles in 
the block (rigend = ^^-biock), whereas due to internal memory management the 
GPU has to send and receive all particles ngend = N (see § 3.1). This efficiency 
loss is quite substantial, and will probably be reduced when we use CUDA as 
programming environment (see §6). 

For the adopted (Hermite predictor-corrector block time-step) integration 
scheme the number of particles in a single block nbiock cannot be determined 
implicitly, though theoretical arguments suggest nbiock oc N'^^^ . Instead of us- 
ing this estimate We fitted the average number of particles in a block time 
step. This fit was done with the equal mass Plummer sphere initial conditions 
running on GRAPE and run over one dynamical (N-body) time unit. The 
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average number of particles in a single block is then 

nbiock ^ 0.20n°-^\ 



(14) 



Calculation. The time spend by the hardware acceleration (tforce) is directly 
related to the speed of the dedicated processor (tcpu), the number of pipelines 
per processor (?ipipe) and the number of operations for one force evaluation 
(r/fe ~ 60). 

^force = ''^fc^^block^GPu/'^pipe- (15) 



The details of the different hardware are presented in Tab. 1 and the measured 
values are in Tab. 3 The GRAPE has a vector pipeline for each processor 
which allows a more efficient force evaluation than the CPU's, the number of 
operations per force evaluation for the GRAPE therefore rj^e — 0(1). 

In order to enable hardware acceleration on our A^-body code we had to in- 
troduce a number of additional operations, like reallocating arrays, which give 
rise to an extra computation overhead. For the calculations with the host 
computer without hardware acceleration we adopt rjfe — 180, a factor of three 
larger than for the CPUs. 



Total performance. The total wall-clock time spent per dynamical (N- 
body) time unit is then 

t ^steps^step- (-^6) 



Here we fitte d to number of block s t eps p er dynamical (N-body) time units. 
According to Makino fc Hut ( 1988 . 1990) '^steps oc n^^^. We measured the 



number of block time steps using the equal mass Plummer distributions as 
initial conditions, using the GRAPE enabled code and fitted the result: 



^steps 



247A^' 



0.35 



(17) 



In fig. 2 we compare the results of the performance model with the measure- 
ments on the workstation without additional hardware (squares) and with 
three attached processors; a single GRAPE-6Af processor board (bullets), an 
FX 1400 (triangles) and the newer GeForce 8800GTX (circles). Note that the 
measurements in Tab. 2 were multiplied by a factor four to compensate for 
the fact that we performed our timings only over a quarter A^-body time unit. 
Though these curves are not fitted, they give a satisfactory comparison. 
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Table 3 

Measurements for the various hardware using in this paper. The first hne gives the 
time spent by the host computer for predicting and correcting a single particle. The 
second row is for calculating the force between two particles. The last two columns 
give the time to send a single particle to, and to receive a single particle from the 
attached hardware. For the calculations with only the host computer this operation 
is not available. In particular the communication with the GPUs turns out to be 
relatively slow. 



pMraiii 


GEAPE-GAf 


8800GTX 


FX ;i 100 




Xcoii 


^host 


3.82 X 10-^ 


3.82 X lO-'^ 


3.82 X 10- 


-7 


3.82 X 10-^ 


^fe^force 


1.11 X 10-^ 


1.04 X 10-'^ 


1.72 X 10" 


■7 


5.29 X 10-^ 


%end^send 


2.00 X lO"'^ 


1.76 X 10-5 


1.89 X 10" 


-5 


NA 


'yrec^rec 


2.00 X lO"'^ 


5.97 X 10"^ 


5.98 X 10" 


-6 


NA 



The largest discrepancy between the performance model and the measure- 
ments can be noticed for the FX1400 GPU, which, for N ^ 10'^ seems to 
perform considerably less efficiently than expected according to the perfor- 
mance model. Part of this discrepancy, though not explicitly mentioned in § 5 
is in part the result of a hysteresis effect in the communication of both CPU's. 
For the 8800GTX, however, this effect is less evident, but still present. Both 
CPU's tend to have a maximum communication speed for blocks of 0.5 Mbyte 
(about 6000 particles). An additional effect which causing performance loss 
on the FX 1400 is the increase in the number of block time steps. This number 
continued to increase beyong our measurements performed with GRAPE (see 
Eq. 14). 

The numbers listed in Tab. 3, and used in our performance model, are the 
optimum values. The communication speed drops by about a factor of two for 
much larger amounts of data transfer to and from the GPU. For the FX1400, 
this drop in communication is considerable, whereas for the 8800GTX it results 
in a smaller performance loss (mainly due to the larger number of processor 
pipelines). The discrepancy for the GRAPE calculation with low is the 
result of neglecting the limited size of the processor pipeline in the performance 
model and due to the irregular behavior of the number of particles in each 
block time step. 

In fig. 3 we present the speed-up for the various hardware configurations, com- 
pared to running on the host workstation. Here it is quite clear that for low 
A^ the GPU's do not give a appreciable speedup, but for a large number of 
particles, the GeForce 8800GTX gives a speedup of at least an order of mag- 
nitude, but not as much as the GRAPE. The latter, however, will not be able 
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Fig. 2. The results of the above described performance model (thick lines) over-plot- 
ted with the results of the measurements for the three attached processors (symbols). 
The bullets represents the results from a single GRAPE-6Af processor board, the 
squares give the host workstation, the circles are for the GeForce 8800GTX and the 
triangles give the FX 1400 graphics processor. 

to perform simulation of more than 128k particleJ^. 



6 Discussion 



We have successfully implemented the direct gravitational force evaluation 
calculation using Cg on two graphics cards, the NVIDIA Quadro FX1400 and 
the NVIDIA GeForce 8800GTX, and compared their performance with the 
host workstation and the GRAPE-6Af special purpose computer. 

For ^ 10^ objects the workstation outperforms the GPUs. This is mainly 
due to additional overhead introduced by the communication to the GPU 
and memory allocation on the GPU. For a larger number of particles the 
more modern GPU (8800GTX) outperforms the workstation by up to about 
a factor of 50 (for 9 million particles). Such a large number of particles cannot 

^ Due to a defective chip on our GRAPE-6 the on-board memory was reduced 
from 128k particles to 64k particles. The latest GRAPE-6Af are equipped with 
256k particles of memory. 



15 



1000 F 



100 r 



10 r 



0.1 



1 1 1 1 1 1 1 1 

8800GT> 
FX140( 
- nPAPF-RA 


1 1 1 1 1 1 1 1 

) 

f _. 

1 


1 1 1 1 1 1 1 1 


1 1 1 1 1 1 1 1 


1 1 1 1 1 1 1 L 

- 








o 












- 


: ■ ■ 1 

1 1 1 ■ ■ 


I ■ ■ m 

1 1 1 


▲ 

1 1 1 


1 1 1 





100 1000 10000 100000 

N 



1e+06 



1e+07 



Fig. 3. The speedup of the two GPUs and the GRAPE with respect to the host 
workstation as a function of the number of particles. The (lower) dotted curve is 
for the Quadro FX1400, the solid curve (middle) gives the timing for the GeForce 
8800GTX and the top line (dashes) represents the GRAPE. 



be simulated on the GRAPE-6Af, due to memory limitations. For up to 256k, 
the maximum number of particles that can be stored on the GRAPE, the 
8800GTX is slower than the GRAPE by about an order of magnitude. Still, 
at this particle number the GPU is faster than the workstation by an order of 
magnitude. 

The GPU has lower accuracy compared to the GRAPE or the host work- 
station. The GRAPE-6 uses 24-bit mantissa for calculating the differential 
position, and 64bit fixed point format for accumulation. The pipeline for the 
time derivative is designed with 20bit mantissa and 32 bit fixed-point notation 



for the final accumulation (IMakino et al.l . |2003| ). In principle the NVIDIA ar- 



chitecture should not be able to achieve similar precision, but would fall short 
in precision by about an order of magnitude compared to the GRAPE-6. The 
average mean error in the energy is (1.7 ± 1.6) x 10~^ for the 8800GTX and 
(5.1 ±0.56) X 10^^ for the FX1400 (averaged over the simulations for = 256 
to = 64k), whereas for the GRAPE we measured (1.9 ± 1.2) x 10~^, which 
is comparable to the mean error on the host. Both the host and GRAPE pro- 
duce an energy error which is about an order of magnitude smaller than that 
of the GPUs. 
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The precision of the GPU is regretfully unlikely to increase anytime soon, 
as the higher precision is not required for graphics applications and it would 
imply a considerable redesign of the hardware. But we could improve the 
accuracy of the GPU even further by sorting the forces on size before adding 
them, summing the smallest forces first. 

The fixed point notation in the GRAPE-6 allows for a much more efficient use 
of clock cycles, allowing effectively one operation per clock cycle, whereas the 
NVIDIA architecture requires more cycles. This turns out to be an important 
reason why the 8800GTX is slower than the GRAPE-6. 

The main advantage of the GPU over that of the dedicated GRAPE hardware, 
is the much larger memory, the wider applicability and the much lower cost 
of the former. The large memory on the GPU allows simulations of up to 
about 9 million particles, though one has to wait for about two years for one 
dynamical time scale. 

In theory the 8800GTX should be able to outperform the GRAPE-6Af, but due 
to relatively inefficient memory access and additional overhead cost, which is 
not present in the GRAPE hardware, many clock cycles seem to be lost. With 
a more efficient use of the hardware the GPU could, in principle, improve 
performance by about two orders of magnitude. For the next generation of 
GPUs we hope that this efficiency bottleneck will be lifted. In that case, 
the GPU would outperform GRAPE by almost an order of magnitude. Note, 
however, that the GRAPE-6 is based on 5 year old technology, and the next 
generation GRAPE is likely to outperform modern GPUs by a sizable margin. 

These current bottlenecks in the GPU may be reduced using the compute uni- 
fied device architecture (CUDA0 programming environment, which is sup- 
posed to provide an improved environment for general purpose programming 
on the GPU. In fig. 4 we present the possible future performance assuming 
that the communication additional overhead on the GPU is lifted, the clock 
cycles are used more efficiently without any assumptions of improved hard- 
ware speed. In the first step we simple reduce communication to blocks rather 
than having to transport all particles each block time step (solid curve). This 
relatively simple improvement can already be carried out using CUDA. The 
second optimalization (dashed curve in fig. 4) is achieved when, in addition to 
reducing the communication we also carried the predictor and corrector steps 
to the GPU. This improvement, however, may be associated with a quite se- 
vere accuracy penalty. For both improvements we used the performance data 
for the current design 8800GTX. Further improvement can be achieved when, 
in addition to more efficient communication the force pipeline can be repre- 
sented more efficiently by the shader pipeline, like is done on GRAPE. The 



see http : //developer . nvidia . com/ cuda 
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Fig. 4. Prospective of future CPU and GPU performance, based on the model from 

§ 5. The two thin curves with squares and circles give the measured performance of 
the CPU and 8800GTX CPU, respectively. The thick solid curve gives a prediction 
of the performance for the 8800GTX in which only blocks of particles are communi- 
cated with the GPU. The dashed curves gives in addition the effect of carrying the 
predictor /corrector calculation to the GPU. The doted curve gives the performance 
of a hypothetical 8800GTX-like architecture for which in addition the processor 
pipeline would be used more efficiently (ryfe = 1). 

result of this hypothetical case would improve performance by more than a 
factor 100 compared to the workstation over the entire range of N. 
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Appendix A 



The N-body code presented in this paper consists of a part implemented in C 
(running on a CPU) and a part implemented in Cg (running on the GPU). In 
this appendix we show the routine that evaluates the acce leration, jerk an d 
potential in Cg (which was based on a tutorial available from lGoddekd ( 20051 )). 
The C code which handles communication between CPU and GPU and sup- 
porting data structures is not presented here. A copy of the entire working 



version of the code is available via |http : / /modesta . science . uva . nl 
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void conipute_acc_jerk_aiid_pot( 

in float2 coords : TEXCODRDO, 

out floats acc : CDLORO, 

out float4 jerkAndPot : COLORl, 
uniform samplerRECT accTexture, 
uniform samplerRECT j erkAndPotTexture , 
uniform samplerRECT massTexture, 
uniform samplerRECT posTexture, 
uniform samplerRECT velTexture, 
uniform float eps2, 
uniform float otherParticle, 
uniform float texSizeX, 
uniform float texSizeY, 
uniform float offset) 



// 2D texture coordinate of this particle 

// Output texture with acceleration 

// Output texture with jerk and potential 

// Input texture with all particles' acceleration 



// >> >» »> »> »> 

// >> >» »> »> »> 

// »> >» »> »> »> 

// >> »» »» »» »> 

// Softening parameter 
// Index to other particle 
// Width of all textures 
// Height of all textures 
// Number of unused texture elements 



jerk and potential 
mass 

position 
velocity 



float coordslD, newCoordslD, otherMass, 

r2, xdotv, r2inv, rinv, rSinv, rSinv 
float2 newCoords; 

floats pos, otherPos, vel, otherVel, dx, 



xdotvrSinv; 
dv, thisAcc, this JerkAndPot; 



// Get data from the textures 

acc = texRECT (accTexture, coords). rgb; 

jerkAndPot = texRECT(j erkAndPotTexture, coords) . rgba; 

pos = texRECT (posTexture, coords). rgb; 

vel = texRECT (velTexture, coords) .v7 .texrgb; 



// Convert the 2D texture coordinate to ID and increase with otherParticle 

// to obtain the coordinate of this iteration's other particle. Because our 

// textures are defined as samplerRECT, texture elements must be addressed 

// as (x+0.5,y+0.5) . When converting to ID, we must compensate for this offset). 

coordslD = round(coords.y-0.5)*texSizeX + round(coords.x-0.5) ; 

newCoordslD = coordslD + otherParticle; 



// Skip over unused texture elements 

if (newCoordslD + offset > texSizeX*texSizeY - 1) 

newCoordslD = newCoordslD - (texSizeX*texSizeY - offset) ; 



// Convert the other particle's ID coordinate to 2D. As above, we must add 
// 0.5 to obtain correct texture element coordinates. 
newCoords = 0.5 + float2( frac(newCoordslD/texSizeX)*texSizeX, 

floor (newCoordslD/texSizeX) ); 

// Get the position, velocity and mass of this iteration's other particle 
OtherPos = texRECT (posTexture, newCoords) .rgb; 
OtherVel = texRECT (velTexture, newCoords) .rgb; 
OtherMass = texRECT (massTexture, newCoords) .r; 



// Compute acceleration, jerk and potential 

dx = otherPos-pos ; 

dv = otherVel-vel ; 

r2 = eps2 + dot(dx,dx); 

xdotv = dot (dx , dv) ; 

r2inv = 1.0/r2; 

rinv = sqrt (r2inv) ; 

rSinv = r2inv*rinv; 

r5inv = r2inv*r3inv; 

xdotvr5inv = 3.0*xdotv*r5inv; 

thisAcc = 0therMass*r3inv*dx; 

this JerkAndPot = otherMass* (r3inv*dv - xdotvr5inv * dx) ; 
acc = acc + thisAcc; 

j erkAndPot . rgb = j erkAndPot . rgb + this JerkAndPot ; 
jerkAndPot. a = jerkAndPot. a - otherMass*rinv; 
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